Skip to content

CI: kill leaf jobs when aborting stale PR dispatchers#11491

Draft
Alexey-Rivkin wants to merge 1 commit into
openucx:masterfrom
Alexey-Rivkin:ci/abort-stale-with-leaves
Draft

CI: kill leaf jobs when aborting stale PR dispatchers#11491
Alexey-Rivkin wants to merge 1 commit into
openucx:masterfrom
Alexey-Rivkin:ci/abort-stale-with-leaves

Conversation

@Alexey-Rivkin

@Alexey-Rivkin Alexey-Rivkin commented May 26, 2026

Copy link
Copy Markdown
Contributor

What?

Make stale-dispatcher cleanup also kill the leaf jobs it spawned, not just the dispatcher.

Why?

doStop() on the dispatcher leaves 8 child jobs (ucx-build-oss, ucx-test-gpu-oss, etc.) running until they finish naturally. Wastes executors on every PR push.

How?

Scrape stale dispatcher's console for Starting building: ... #N to find leaves, doKill() leaves first then dispatcher, retry 3x with 5s gap.

@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

/build

@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

@NirWolfer

@Alexey-Rivkin Alexey-Rivkin requested a review from dpressle May 26, 2026 13:15
@Alexey-Rivkin Alexey-Rivkin force-pushed the ci/abort-stale-with-leaves branch from 3c2a30a to 2abcb24 Compare May 26, 2026 13:26
@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

/build

@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

/build

1 similar comment
@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

/build

The OSS dispatcher already aborts the prior dispatcher build for the same
PR, but the 8 leaf jobs it spawned via `build job: ..., wait: true` keep
running until they finish naturally. That ties up build executors every
time someone pushes a new commit to a PR.

Port the pattern NIXL uses: scrape each stale dispatcher's console log
for `Starting building: <name> #<num>` lines to find its children, kill
the leaves first so the dispatcher's wait unblocks, then kill the
dispatcher. Use `doKill()` instead of `doStop()` for a hard stop, and
retry up to 3 times with a 5s gap so builds caught mid-startup still get
torn down.

Failures inside the abort block are still swallowed and logged - we
never want stale cleanup to break a fresh build.
@Alexey-Rivkin Alexey-Rivkin force-pushed the ci/abort-stale-with-leaves branch from 2abcb24 to 5be718f Compare June 8, 2026 07:25
@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant